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Abstract 

In functional linear regression, the parameters estimation involves solving a 
non necessarily well-posed problem and it has points of contact with a range of 
methodologies, including statistical smoothing, deconvolution and projection on 
finite-dimensional subspaces. We discuss the standard approach based explicitly 
on functional principal components analysis, nevertheless the choice of the number 
of basis components remains something subjective and not always properly discussed 
and justified. In this work we discuss inferential properties of least square estimation 
in this context with different choices of projection subspaces, as well as we study 
asymptotic behaviour increasing the dimension of subspaces. 

Keywords :Functional Regression, Functional Principal Component Analysis, Asymp¬ 
totic properties of statistical inference 

1 Introduction 

It is more and more common in recent years that applications of regression analysis 
are concerned with functional data. It is the case, for example, when the explana¬ 
tory variables are curves (or are a digitized points of a curve) linked to a scalar 
response variable. This arises, for instance, in chemometrics, where some chem¬ 
ical variable has to be predicted by a digitized signal such as the Near Infrared 
Reflectance (NIR) spectroscopic information (see 0, 0). Other examples concern 
environmental problems, like prediction of total annual precipitation for Canadian 
weather stations from the pattern of temperature variation through the year CD], 
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or linguistic issues [5], like the analysis of the relationship between log-spectra of 
sequences of spoken syllables and phoneme classification [7j. 

In all these cases, classical regression models for multivariate data may be inad¬ 
equate, since the functional nature of covariates should be exploited using proper 
estimation and inferential techniques. 

In other words, in functional linear regression, the parameters estimation in¬ 
volves solving an illposed problem [3] and has points of contact with a range of 
methodologies, including statistical smoothing and deconvolution (see, among oth¬ 
ers, pQ and references therein). The standard approach to carry out estimation 
and inference on regression parameters is based explicitly on functional principal 
components analysis (FPCA, see jTD] and references therein) and, consequently, on 
spectral decomposition in terms of eigenvalues and eigenfunctions. Despite FPCA, 
or analogous projection methods, are often effective and straightforward to apply 
to the analysis of functional data, the choice of the number of basis components 
remains something subjective and not always properly discussed and justified. Even 
if several criteria exist to determine the number of basis functions to be selected 
dimensional reduction methods per se do not ensure the proper estimation of the 
regression parameters. We show that, given the sub-space identified by the the 
chosen basis, the classical procedures do not automatically ensure to obtain an un¬ 
biased estimate neither of the true functional coefficient nor of its projection on the 
correspondent sub-space. 

In this work we face the functional linear model with scalar response. In our 
model a real random response Y is linked to a square integrable random function X 
defined on some compact set T of R, as 

Vi = J Xi(t)/3(t)dt + ei, i = l,..,n, (1.1) 

We discuss the choice of suitable finite sub-spaces of L 2 (T), called identifiable sub¬ 
spaces, where the least square estimation problem is well posed. We point out the 
properties in terms of bias and variance of the related estimators. Moreover we 
explain the reasons why the FPCA comes out to be the optimal solution of a bias- 
variance trade off problem when no information are available on the space where the 
regression parameters are defined. Finally we discuss the influence on the parameters 
estimates (in terms of bias) of the orthogonal component of the sub-space identified 
by the FPC basis, and we provide a simulation study that shows the theoretical 
results. 

The paper is organized as follows: firstly, the model setting and the functional 
parameters estimation (Section^ together with a critical discussion of their inferen¬ 
tial properties (Section [3]) are presented for finite dimensional sub-spaces. Then the 
large dimensional case is considered (Section^ and asymptotic results for increasing 
size of the sub-space are introduced. Appendixes [A] and [B] gather some auxiliary 
results while Appendix [C] contains the setting details of the simulation study. All 
the analyses are carried out with R, see [9]. 
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2 Model setting and functional least square esti¬ 
mation 

Let consider the functional model in Q, 

Hi = J Xi(t)(3(t)dt + ej, i = l,..,n, 

where /3 £ L 2 (T), with T compact set of R, Xi £ L 2 (T) and e, £ R. We will 
consider f3 as a deterministic function and ei, e n as random variables independent 
of xi,..,x n , with E [ei] = 0 and Var[ei] = a 2 > 0. We assume to collect the 
values of the outcomes jq, .., y n and to observe the data Xi only in correspondence 
of p discrete values ti, ,.,t p £ T, i.e. the data are (xi(t i), ..,Xi(t p ),yi) for i = 1, ..,n. 
This is the case treated for example in PH 12 SHED], among others. To ease notation, 
we let (a,b) denote the usual inner product in L 2 (T ), as (a, b) = J T a(t)b(t)dt, and 

||a|| is the corresponding norm \Jj T a 2 (t,)dt. Accordingly, we can write the model |T]) 
as 

Vi = (xi,/3) + ei, i = 1, •■, n. (2.1) 

To obtain the asymptotic results presented in the paper, we assume that xi,..,x n 
are i.i.d. realizations of a process X with support Sx C L 2 (T), zero mean and 
bounded second moment, i.e. E[X] = 0 and i?[||X|| 2 ] < oo. In general, neither the 
distribution of the random process X nor its support Sx are assumed to be known. 
The quantities ..,e n model the errors in observing the outcomes yi, .., y n , and so 
are assumed to be unknown. The function /3 is unknown and its estimation is the 
main focus of this paper. 

We need the following setting to describe the functional estimation presented in 
the paper: let S be the smallest closed sub-space of L 2 (T) such that Sx C 5, and 
we call S 1 - the sub-space of L 2 (T ) orthogonal to S, so that 

L 2 (T) = S®S ± . (2.2) 

In general, the set S may not coincide with L 2 (T), that means S ± ^ 0. For instance, 
consider the following process in L 2 (T), with T = [—1,1]: 

OO 

m = E UkVk<Pk{t), (2.3) 

k =o 

where {Uk '■ k > 0} are i.i.d. uniform random variables in [—1,1], {?^ : k > 0} a 
sequence of positive coefficients such that < oo, <po{t) = \/\/2 and tpk(t) = 

cos ( nkt);k > 1. 

In this case, the support Sx is composed by the even functions such that 
|(< 7 , <pk) | < rjk, for any k > 0 and g £ L 2 (T). Then, the smallest sub-space of L 2 {T) 
including Sx coincides with the set of the even functions, while the orthogonal space 
is represented by the odd functions, i.e. 

S := {ipk{t)\k > 0} , S ± := {sin (7 rkt)-,k > 1} . 

Remark 2.1 It is worth saying that the results on the estimation of (3 presented in 
the paper also hold when the model is 

yi = a +(xi,/3) + e iy i = l,..,n, 
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with a £ R, or when E[X\ ^ 0. In these cases, the model (2.1) is applied to the 
centered data, i.e. yi — and Xi — * = so that the 

asymptotic results are straightforwardly verified. 


2.1 Functional least square estimation in finite sub-spaces 

In the multivariate regression analysis, a common approach to solve the problem of 
the estimation of ft is to compute the least square estimator. However, it is well 
known that this approach can’t be straightforwardly generalized to the functional 
context, not even in the case of X\, ..,x n entirely observed for any t £ T. In fact, 
the extension of the least square estimator to the functional framework would be 


fir, := arg min f(b) = arg min 
6 bei 2 (T) J y ' & beL 2 (T) 


U=i 


(2.4) 


and it is trivial to note that for any n 6 N, ii,.., x n £ L 2 (T) and y \,.., y n £ R there 
exist infinite functions b £ L 2 (T ) such that f(b) = 0, even for S = L 2 (T). Then, the 


estimator f3 n can never be well defined by following the least square approach (2.4). 


However, a least square estimator of (3 can be computed in a finite sub-space D 
of L 2 (T). In fact, let D C L 2 (T) be a sub-space where the data aq are reconstructed 
from their discrete observation (a?j(ti), ..,Xi(t p )) by classical smoothing techniques, 
so obtaining x® £ D. Therefore, we will simply assume that x® represents the 
projection of Xi on D , and in particular that 


„d 


9) = i x i,g), Vg £ D. 


(2.5) 


Given (2.5), the following minimization problem is, under mild conditions, well 
posed: 


fin : = argmin f{b) = arg min VVy* - <£*,&)) 

b£D ha D I ^' 


beD 


( 2 . 6 ) 


. i =1 


and it can be computed exactly since from (2.51 the real function Xi £ S can be 
replaced in (2.6) by its reconstruction xf £ D. First, note that if d dim(Z)) 


is greater than the sample size n, or than the number p of observation points, 


the solution of ( 2 . 6 ) is not unique as in (2.41, which provides us the condition 
d < min{n;p}. Moreover, if there exists /3q £ D D S 1 - then (a:*, fi D + fi 0 ) = (xi,fi D ) 
for any fi D £ D , which implies that the minimum is not unique and so ft® is not 


well defined. From (2.5) we have that the same situation occurs when we replace 


Xi with its reconstruction x®. To avoid this problem, we introduce the following 
concept. 

We call identifiable any sub-space D such that D D S 1 - = 0. We recall that S 
and S 1 - are individuated by Sx i which is in general unknown. Then, the statistician 
has the important role of choosing a sub-space with no components orthogonal with 
respect to the sample data, which are formally the components lying in S ± . 

It is worth highlighting that estimating (3 in a finite sub-space D is intrinsically 
a consequence of the reconstruction procedure of the data Xi on D. In fact, if we 
consider the problem (2.4) computed with the reconstructed data xf £ D, it is easy 
to see that for any 61,62 such that (fii — b 2 ) £ D 1 -, we have f{b\) = /( 6 2 ) and the 


solution of (2.6) can’t be unique. Hence, the uniqueness of the solution of (2.6) can 
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be obtained only by restricting the problem to the sub-space where the data have 
been reconstructed, that is D is an identifiable sub-space. 

Moreover, in some application, the physical context of the problem may provide 
a prior information on f3 so that searching a solution in a specific sub-space D could 
be the smartest thing to do. In this case, even if the data x* are perfectly recorded 
at any t £ T, the problem (2.6 1 would only consider their projection on D , since the 
part on D E is useless. In fact, from (2.6) the components of the data x t orthogonal 
to D are irrelevant, because b D £ D and the orthogonal part vanishes in the scalar 
product ( Xi,b D ). Then, the strategy of searching j3 £ D through (2.6) suggests to 
reconstruct the data on D. 

In practice, the a priori information on f3 may not guarantee to determine a 
finite sub-space D where /I belongs to. Then, the sub-space D is typically chosen 
to reconstruct the data xi,..,x„ at best, and so we can imagine that in general 
the true (3 may not lie in that sub-space D. In this case, it is not clear what f3 E 
defined in (2.6) is actually estimating, and which are its statistical properties. In 
the following section, we provide an answer to this issue. For instance, we will show 
that, in general, the least square estimator f3 E computed on D does not converge 
to the projection of the real (3 on Z?, as one may expect. Moreover we will discuss 
the collinearity effects in the estimation of /3, which plays a central role in the 
unbiasedness and consistency of the estimator f3 E . 


3 Properties of Least Square Estimator in finite 
sub-spaces 


To investigate the statistical properties of j3 E , we rewrite (2.6) in a slightly 
way. First, we introduce the projection operator it : D —> S of D on S, 
E C S the image of 7r, i.e. 


different 
and call 


f dim(S) | 

E := j xGS : 3y£D,x= ^ {y,Vk)vl >> 

where ; k = 1,.., dim(S')} is an orthonormal basis of S. Naturally, the definition 
of E implies that D C E © S 1 ". 

Appendix |A| is dedicated to explore more precisely the relation among D and E: 
for any given D and S , we describe how to compute an orthonormal basis for E and 
we provide the analytic expression of the projection operator n. Here, we focus on 
the following properties: 

(1) since D D fr 1- = 0 (D is identifiable) and D is finite dimensional, it is possible 
to show that n is invertible (see Appendix 0 , so that 7r is a bicontinuous 
operator from D to E\ 

(2) for any b D £ D , calling b E = 7t( 6 jD ), we have that 

(xi, b D ) = (Xi,b E ) + (x h b D - b E ) = ( Xi,b E ), 
because ( b D — b E ) £ S E and Xj £ S. 

From (1) and (2), we have that f(b D ) = f(b E ) for any b D £ U, so that the element 
of D that minimizes / is univocally associated through the projection 7r with the 
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element of E minimizing /. Hence, the least square estimator computed minimizing 

Pf = Or)" 1 ®, 


in ( 2.61 can be obtained as 


where 


Pn '■= argmin f(b) = argmin < YVj/i - (xi,b)) 

b£E ha R I ^' 


beE 


(3.1) 


(3.2) 


Then, in the following, we study the statistical properties of fif to describe the 
behavior of the estimator {3f computed in D, which is the sub-space individuated by 


the experimenter as mentioned before. The problem (3.2) is solved in Subsection 3.1 
where the properties of (3f are inve stiga ted. After that, a wide analysis on the 
behavior pf is detailed in Subsection 


3.2 


Finally, to sake of simplicity, we define the sub-space F = S C I fi 1 , so that we 


replace (2.2) with the following expression 

L 2 (T) = E®F®S ± . 


(3.3) 


Then, a unique orthogonal decomposition can be realized for any (3 £ L 2 (T ): 

P = p E + P F + P S± , (3.4) 

where (3 € D implies /3 F = 0, since D C E © S E . 

3.1 Characterization of the least square estimator in E 


In this section, we focus on solving (3.2) and we obtain the main properties of 


Pf. Given any orthonormal basis for D and S , denoted by {<pf;k = 1 , ..,d} and 
= 1,.., dim(S')} respectively, we can compute the orthonormal basis for E 
and we denote it as ^p E (t) := {^> E ',k = 1, ..,d} (see Appendix [Al for the details). 
Then, we call xf the projection of x t on E and note that for any b E £ E 

(xi, b E ) = ( xf , b E ) + (xi - xf, b E ) = {xf, b E ), 


since {xi — xf) lies in a sub-space orthogonal to E. Hence, (3.2) can be solved with 
finite dimensional quantities, obtaining (3f{t) := (Pf) T ■ tp E {t), where 


Pn :=arg min {{y - X E b E ) T {y - X E b E )} , 

b E e«. d 


(3.5) 


y is the n-vector composed by the observed values y = (yi ,.., y n ) T and X E is the 
n x d-matrix, where [X E ]ij = {xi, ip E ). As in the multivariate theory, we can easily 
obtain 

Pf = ((- X E ) T X E r\X E ) T y . (3.6) 


Now, let us discuss the statistical properties of j3f . Using decomposition (3.4 1, the 
model can be written as 


Vi = (xuP) + e» = (xi,P ) + (xi,/3 )+<a 

for any i = 1 ,..,n, since x, t is orthogonal to /3 s . In matrix notation, the last 
expression becomes 

y = (x,p E ) + (x, (3 f ) + e 
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where x(t) = (xi (t), ..,x n (t)) T , e = (ei,..,e n ) T and y = (y 1 ,..,y n ) T . Since the 
dimension of E is finite, we can rewrite the last expression as follows 

y = X E (3 E + (x,/3 F )+e, (3.7) 


where (3 1 


is the vector such that /3 E (t) = (/ 3 E ) T ■ ip E (t). Note that the 


estimator f3 E is computed in (3.6) only with the data projected on E , i.e. X E \ then, 
the quantity ( x , (3 F ) in (3.7) represents the part of the data which has not been used 
to compute (3 E , so that the least square estimation approach treats (x,(3 F ) in (3.7) 
as the independent error e. Nevertheless, the quantity (x,/3 F ) can be correlated to 
X E , and this correlation plays a central role in the estimation of /?. 

To characterize /3 E , 


we substitute (3.7) in (3.6), obtaining 


A 


E 


= f3 E + ({X E ) T X E )~\X E ) T (x, f F ) + {{X E ) T X E )~ 1 {X E ) T e 
= P E + 7™ + (( X E ) T X E )~\X E ) T e , 


where -y% := (( X E ) T X E ) 1 (X E ) T (x, j3 F ). Then, conditioning to the data x(t), 
the quantity (3 E presents the following features: 


E 



/3 E +7n, Cov ((3 e \x 


a 2 ((X E ) T X E y\ 


(3.8) 


The term catches the relation among X on E and X on f3 F , see also Subsec- 
. Moreover, since x\, ..,x n are i.i.d. realizations of X and £1[||X|| 2 ] < 00 , we 
can apply the Strong Law of Large Numbers (SLLN) obtaining 


tion 


3.2 


ln “ 4 - 7 := (E[X e (X e ) t ]) *E[X e {X,P f )], (3.9) 

where X E := {X,^ E (t)) £ R d . Using ^ we get that $ E 4 „ (3 E + 7 . The 
quantity 7 has a direct functional representation given by 7 (t) = ( 7 ) T • ip E (t), and 
we directly obtain the consistency of (3 E : 

Af 4 P E + 7- (3.10) 


Remark 3.1 Note that, since -E[.Y] = 0, the bias 7 can also be written as 

7 = (S E )~ 1 Cov(X E ,(X,^ F )), (3.11) 

where Tj E := Cov ■ The meaning of 7 can be easily seen when it is represented 
along the principal components of X E . If we denote with V E the matrix composed 
by the eigenvectors ( 1 / 7 , ipd) of T, E , and if we call Z E := ( X E ) T ■ xjjk = (X,ifk) 
for k = 1 ,..,d, the bias along the k th principal components (i.e. 5k = ( 7 ,ipk)) can 
be express as follows 


6 k = (Var[Z E ]) 1 -Cov [Z E , (X,p F )] 


( Var [(X,/3^)] \ 

V Var[Z*\ ) 


1/2 

■Cor [Z E ,(X,p F )] , 


which shows how the bias reflects the correlation among X on E and X on F. 
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3.2 Discussion on the least square estimation in finite iden¬ 
tifiable sub-spaces 


We now discuss the behavior of the least square estimator in the finite sub-space D 1 
i.e. (3 E . To this aim, we consider the results (3.8) and (3.10) related to (3 E , and, 
through the relation (3.1), we discuss the properties of f3 E . In particular, in this 
subsection we focus on the asymptotic behavior of /3 F , even if analogous arguments 
can be used to describe its bias for fixed n. Since 7 r _1 : E D is continuous, the 
consistency of [3 E can be easily obtained from (|3. 10): 


dD 


7T 1 (P E + 7) ■ 


(3.12) 


The real issue here is to understand what this limit represents. The discussion is 
structured as follows: we analyze the consistency of the least square estimator f3 E 
in these different cases 

(a) /3 £ D; 

(b) p$D, but (3 F = 0; 

(c) p$D, and P F ± 0. 

Case (a): (3 £ D. In this situation, we trivially have /3 F = 0 since D C E ® S F ; 
this implies /3 = f3 E + /3 s and q ra = 7 = 0 for any n > 1 by definition. Moreover, 
since (/3 F + (3 S± ) £ D , we have that 7 T ” 1 (/3 F ) = /3 E + (3 S± . Hence, we obtain 

WPn ~ P\\ ^ 0 . 


Then, when the true /3 belongs to the sub-space D , the least square estimator on 
D is consistent. In Figure [TJ-third panel, we report 100 independent simulations 
detailed in Appendix in which an estimate of f3 E is computed for large n and 
/3 £ D. The pointwise mean of the estimates of f3 E (dotted line) is very close to 
the true /3 (solid line). This shows that the estimator (3 E is unbiased and consistent. 


Case (b): (3 £ D, but f3 F = 0. Analogously to case (a), j3 F = 0 implies (3 = 
/3 E +/3 S± and = 7 = 0 for any n > 1. However, in this case 7 r _1 (/3 f ) 7 ^ /3 E +/3 S± , 
so that 

\\ft-p\\ ‘4- II (tt -1 (f3 E ) — P E ) — P S± ||i 

which means the estimator /3 F is not consistent for j3. The asymptotic bias belongs 
to the sub-space orthogonal to the data, i.e. 

( 7 r" 1 (f3 E ) - /3 e ) - /3 S± £ S ,± . 

This latter fact can be seen in Figure [l]- first and second panels. In fact, in this sim¬ 
ulation the difference among f3 (solid line) and the pointwise means of the estimates 
of f3 E (dotted lines) are odd functions, i.e. S F in the example. 

Since the errors in estimating (3 with [3 E belongs to a space which can’t be 
explored by the data, the bias can be eliminated only by using a priori information 
on f3 to modify the choice of D. It is also worth observing that this bias is totally 
irrelevant if the interest in estimating f3 is only related to the quantity (x, (3) in the 
regression context, because the estimation and inference of the inner product is not 
influenced by any component of /3 in S F . 
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Space E Space D it/6 




T T 

Space D rc/3 



Figure 1: The solid lines are the true /3(t), the dashed lines are the projection of /3(t) 
in the corresponding spaces, the dotted lines are the pointwise means of the estimated 
j3(t). For further details on the simulation setting see Appendix [cj 
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Space E 


Space of Principal Coponents 



Figure 2: The black lines are the true 0(t), the dashed lines are the projection of 0(t) 
in the corresponding spaces, the dotted lines are the pointwise means of the estimated 
/3(t). 


In other words in all the cases of Figure [l] the pointwise means of the estimates 
of 0 E (dotted lines) only differs in their odd component. Hence, since the data 
are even functions, the inference on (X,0) is equivalent. Summing up in case (b), 
the choice of D does not influence the explanation of the phenomena related to the 
regression, but it is relevant when the interest lies in the reconstruction of the true 0. 

Case (c): 0 ^ D, and 0 F ^ 0. In this case, in general we have that ± 0 for 
n > 1 and 7 7 ^ 0. The asymptotic distance among 0 F and (3 can be divided in three 
orthogonal terms: 

Il3£- P\\ 2 “4 ini 2 + INI 2 + || (^(^ + 7 ) -os*+ 7 ))-0 s "II 2 . 

where 0 F £ F, 7 £ E and 

(tt -1 (p E + 7 ) — (P E + 7)) — P S± € S\ 

In Figure [2]Jeft panel, we report 100 independent simulations in which an estimate 
of 0 F is computed for large n (see details in simulation setting in Appendix [c]). 
Since in case (c) we are mainly interested in the estimation on S, Figure [ 2 ] consider 
D = E and /3 s , so that there is no bias on S F . 

The bias on the sub-space F is always present in this situation, and it is simply 
due to the fact that £ D, which is included in E ® S F , while 0 £ E ® S F when 
0 F 7 ^ 0. Naturally, this bias also influences the statistical analysis on the outcome 
y, since the contribution of (X,0 F ) to y is not taken into account. 

When the aim of the analysis is to reconstruct only the component of 0 on 
a particular sub-space, given by D and the functions orthogonal to the data, i.e. 
E © S ± 1 the bias on F is not of interest. However, the analysis on the estimation 
mainly focus on the bias on E: 7(f)- This function indicates the asymptotic bias 
among 0 F and 0 E . In Figure [2j-left panel, 7 (f) is represented by the difference 
among 0 (solid line) and the pointwise mean of the estimated 0{t) (dotted line). 
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As mentioned in Subsection |3.1| this bias is due to the fact that the part of the 
process X on E can be correlated to the part of X along the component f3 F £ F. 
We may say that the bias 7 „(t) puts in the estimate given by Q E the additional 
information related to the contribution of (X,/3 F ) in computing y,;. Then, even 
if (3 E is the closest element of E to /3, the function of E that better reconstructs 
Ui from data is (/ 3 E + 7 ), since the contribution ( X,(3 F ) is not observable. From 
Remark 3.1 note that if E[X] = 0 and E is composed by d eigenfunctions of the 
covariance structure of X (Karunen-Loeve basis), then q(t) = 0. In fact, in this 
case there is no information of F contained in E, and then fi E is also the function 
that better constructs yi from data in E. In fact, in Figure [2]-right panel, where D 
is the sub-space generated by the firsts principal components of the data, the true 
/? (solid line) and the pointwise mean of the estimated /3(f) (dotted line) coincides 
(7 (t) = 0). 


3.3 A bias-variance trade off in the estimation in finite sub¬ 
spaces 

In this subsection, we highlight an interesting bias-variance trade off concerning 
the choice of the sub-space where the least square estimator is computed. Before 
introducing this trade off, let us discuss the covariance structure of the estimator 
fin, which we define as (ip D (s)) T ■ Cov((3 ■ ^p D (t) : since /3 E {t) = (/3^) T • 

We now use the projection matrix P such that, /3% = P~ l {Pn)^ that is computed in 
Appendix [XJ Through this operator, we can express the relation among Cov(fd^) 
and Cov(f3 E ) as follows 

Cov0Z) = P~ 1 Cov(fi E )(P~ l ) T . 

in Appendix [Aj we obtain 

Cov($°) = V d D- 1/2 Cov0 e )D- 1/2 V f , 

where Do and Vo represent the eigen-structure of P T P , i.e. P T PVo = VoDo- 
Denote with v E ,..,v E and v E ,..,vf the eigenvalues of Cov((3^) and Cov((3 E ), 
respectively. Then, we can observe that 

(i) since P is a projection matrix, all the eigenvalues of P T P are less than one. 
Hence, all the elements in D d are greater than one. So, the variance of the 
retro-projection due to P~ l is non decreasing in any direction, i.e. v E > v E 
for any k = 1 , .., d] 

(ii) if D C S, all the eigenvalues are equal to one and the total variance is the 
same, i.e. v E — v E for any k = 1 ,.., d; 

(iii) if all the eigenvalues of Do are greater than a value eo > 0 , we can uniformly 
control the variance of /3^, i.e. u E < 1 /t D ■ v E for any k = l,..,d. 

From these properties we can distinguish two interesting cases of bias-variance trade¬ 
off related to the choice of the sub-space D: 

(1) Consider all the possible identifiable sub-spaces D with the same projection 
E 0 on S, i.e. D C E 0 © S ' 1 and DCS 1 =0. From (i) and (ii) we have that 
the variance of f3 E is minimized by choosing D = E 0: that is D C S. However, 


From (A.2 
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to reduce the bias on S ~ L , some a priori information on (3 may suggest another 
choice of D. For instance, consider the case j3 F = 0 and (3 s ^ 0: if we choose 
D = Eq the variance of Q F is minimized but we have a bias on S F (case (b)), 
while if we choose D such that [3 C D, the estimator fi F has no bias but the 
variance may be very high. Figure |T] describes this situation: when D = E 
(Figure [TJ first panel), the variance of the estimates is low but the pointwise 
mean of the estimated (3 F does not target /3; when D = D n / 6 (Figure [l|second 
panel), the variance of the estimates increases and the bias decreases; when 
D = D ^/ 3 (Figure [lj-third panel), the variance of the estimates is high but 
there is no bias. 

When the sample size n or the number of discrete observations p are not too 
large, we may prefer a small variance even if the estimator is biased. Naturally, 
when we have no previous information on (3, there is no chance to control the 
bias and the smartest choice is to minimize the variance by choosing the closest 
D to the space of the data S. 

(2) Consider all the possible sub-spaces ECS, generated by the projection of D 
on S. To ease notation, take D C S (i.e., D = E). It is well known that 
the variance of the estimator (3 F is smaller when the variance of the data is 
higher. Then, the variance of /3 F is minimized when D coincides with the space 
generated by the first principal components of X on S. However, to reduce 
the bias on F, some a priori information on (3 may suggest a different choice 
of D. For instance, taking (3 s = 0, when D is equal to the space generated 
by the first principal components (PCs), the variance of (3® is minimized but 
we have a bias on F, since in general j3 F ^ 0 (case (c)); nevertheless, when D 
is such that f3 £ D, the estimator (3 F has no bias (case (a)) but the variance 
may be very high. Figure [3] describes this situation: in Figure [3}left panel, we 
have a space D that includes (3, and so the estimates f3 F of f3 are unbiased 
but they show a large variance; in Figure [3]-second panel, the space of the first 
PCs does not includes /3, and so the pointwise mean of the estimated f3 F does 
not target (3, but the variance is low. 

When the sample size n or the number of discrete observations p are not too 
large, we may prefer a small variance even if the estimator is biased. Naturally, 
when we have no a priori information on f3, there is no chance to control the 
bias and the smartest choice is to minimize the variance by choosing the closest 
E to the space generated by the first PCs. 


4 Estimation in large dimensional sub-spaces 


In this section, we discuss the behavior of the estimator f3 F obtained in (2.6) when 
the dimension of D is arbitrarily large. In other words, we want to investigate how 
to compute a well-defined estimator for (3 in an infinite dimensional sub-space D. 
To deal with this case, we express D as the closure of a countable union of finite 
sub-spaces {D d ,d > 1}, i.e. 

D := U {D d }, 

d> 1 


where D d C D d+ \ for any d > 1. We denote with {ip F ,d > 1} the orthonormal 
basis of D , such that {ip F , k = 1, d} is an orthonormal basis of D d , for any d > 1. 
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Space E 


Space of Principal Coponents 




Figure 3: The black line is /3(f), the dashed line is the projection of /3(f) in the corre¬ 
sponding space, the dotted line is the pointwise mean of the estimated /3(f). 


Note that dim(Dd) = d. A basic idea to construct an estimation procedure of /3 
in D is to consider the estimators {/3^ d ;d >1} computed in the finite dimensional 
spaces {Dd]d > 1} by (2.6), and investigate their asymptotic behavior for large d. 
In fact, from ( |3. 12 ) we have that {linij^oo 8n d } exists finite for any fixed d > 1; 
however, 8n d can be considered a proper estimator in D for j3 only if the sequence 
of the limits llim^oo /3® d ;d > 1 j is convergent when d —> oo. 

Here we consider sub-spaces D d with arbitrarily large dimension, it is worth 
making an important consideration on the estimator 8n d . As mentioned in Sub¬ 
section |2.1[ a least square estimator for /? in a finite identifiable sub-space is well- 
defined only if both the sample size n and the number of observations per curve p are 
greater than the dimension of the sub-space itself. Then, 8n d can be computed only 
if min{n;p} > d\ moreover, whenever we let d increase to infinity, we are implicitly 
requiring that both n and p must diverge with a rate depending on d. Therefore, in 
all the situations in which n or p can’t increases arbitrarily, the results presented in 
this section do not hold. 

In the following, we consider a framework analogous to the one presented in 
Section [3] for each d > 1, let E d be the sub-space obtained by the projection of D d 
on S , i.e. 

! dim(5) 

X eS : 3y€D d ,x= ^ {y,ipl)y> s k 

k =1 

and let define 


E 




d> 1 


F d :=SnE^, 


F :=SHE ± . 


So, we have that D C E® S ^ and L 2 (T) = E © F © S' 1 * and any 8 € L 2 (T ) has the 
following orthogonal decomposition: 8 = 8 E + 8 F + 8 S ■ 
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4.1 Estimation instability in large dimensional sub-spaces 

In this subsection, we show that the limit of the sequence P Ed ]d > lj 

may not exist, even when P £ D. To do this, we discuss an example where the 
sequence jlinq^o, 0 /3^ d ;d > lj does not converge when d —► oo. Since from (3.12) 

Pn d ~~*n P Dd + 7 d a - s - f° r an y d > 1, where is the asymptotic bias on E d and 
since p Dd — > d (3 D , then it is sufficient to show that {|| 7 d ||; d > 1} is not bounded as 
d increases. 

Consider a process A', with E[ A] = 0, defined on an infinite dimensional sub¬ 
space D , with Karhunen-Loeve (K-L) basis {ipk',k > 1} and corresponding eigen¬ 
values {A k ',k > 1}. The sequence {X k ',k > 1} is decreasing in k (i.e. A max = Ai > 
A 2 > • • •). Let {ipk', k = 1,.., d} be a basis for E d , defined as follows: 


<Pk = 


cos (dk)i>k + sin(9 k )ip k+1 if k odd; 

sin(0fc_i)^fc_i - cos(9 k -i)i>k if k even, 


(4.1) 


where the sequence {9 k \ k > 1} will be appropriately det ermined more ahead. Using 
the representation j d = V E 8 d presented in Remark |3.l[ it is sufficient to show that 
{||d d ||;d > 1} is not bounded as d increases. To this aim, note that the K-L basis 
of A' projected on E d is Upk', k = 1,.., d — 1} U {tp d } for d odd, and {if) k -, k = 1,.., d} 


for d even. By Remark 
when d is even and 


3.1 


it is easy to see that 8 d = 0 for any k < d, while 8 d = 0 


x d — 

°d ~ 


Cov ((A - , <p d ), (X, P Fd )) 


Var ((A, tp d )) 

when d is odd. Hence, ||5 d || = 0 for d even, while ||<5 d || = |5j| for d odd. This 
last term is not zero because of the correlation among the projection of A on ip d 
(included in E d ) and the projection of A on <p d +i (included in F d ). By writing 
P = E fe >i Pki>k, and A projected on E d as Efc=i z ky/Xk4>k, we obtain 

(A ,ipd) = Z d \fx d cos(9 d ) + Z d+ i\f~A d +ism(9 d ), 

OO 

(A, p Fd ) = ]T z k p ky /Y k 

k=d-\- 2 

+ Pd (z d ^/Y d s\n(0 d ) — Z d+ i\J Ad+i cos (9 d )j , 

where p d = \p d sin(9 d ) — P d +i cos(#d)|- Then, from some easy calculations we have 
that 

udi _ (\c°s(9 d )sm{9 d )\n d \ ~ 

“d — I 2<a \ I 1 ) ’ P d ' 

\ fj-d COS 2 {9 d )+ 1 ) 

where fi d = X d /X d +i — \. Now, consider any sequence { e d ; d > 1} such that \P d \/e d —> 
00 , and take 9 d = n/2 — e d and A^+i = Ad/(1 + exp(e^ )), so that for large odd d 


||<5 d || ~ 


I Pd 

e d {l + exp(-e- 1 )) 


-td 00. 


This concludes the example where the sequence {limn^oo 
converge for d —> 00 . 


Pn d \d > 1} does not 
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4.2 Principal components for estimation in large dimensional 
sub-spaces 

In Subsection |4.l| we have shown that the limit of the sequence jlimn^oo j3 ^ d ; d > l| 
does not exist m general, not even when /3 £ D. We discuss how to introduce an 
alternative least square estimator well defined in the case of /? C D. We will denote 
this estimator as /3 d,k , where n > 1 is the sample size and d> k > 1 are two integer 
parameters associated to the dimension of the sub-space. In this subsection, we 
show that, when f3 £ D, there exists a sequence kd —> oo such that the sequence 
jlim™-,.,*, /3 d ’ kd -,d > 1 j converges when d —> oo. This will let us consider f3 d,kd as a 
proper estimator for (3 when d is large. To obtain this result, we need to assume the 
following conditions 

(i) /3 £ E ® S F , that means j3 F = 0; 

(ii) DCS , that means D = E and Dd = Ed for any d > 1. 

It is worth highlighting that these conditions are not restrictive and in literature 
they are always assumed to be true. In fact, most of the existent works consider 
the limiting space D equal to the space that generates the data, i.e. D = S, which 
implies both conditions (i) and (ii). 


Let i = 1,.., d} be the K-L basis of X projected on the sub-space Dd , for any 


d > 1 and recall that /3® d I s the least square estimator computed on Dd from ( 2 . 6 ); 


then, we define /3 d,k as the projection of (3 Fd on the sub-space generated by the first 
k functions of the K-L expansion in Dd, i.e. 


P d u k {t) := <><(*). 


(4.2) 


Analogously, we define /3 d,k and 7 d,k as the projections of (3 and r ) d , respectively, on 
the sub-space generated by the first k eigenfunctions of X in Dd, i.e. 

k k 

/3 d,k (t) := ^(A^)^(i), 7 d ' k (t) := ^( 7 d ,^)^W- (4.3) 


Since from (3.12) we have that f3 Fd — f3 Dd + 7 d a.s., we can project all the terms 
on the sub-space generated by {ipf, ■■,^ d }, obtaining 




! k a - s • 


3 d ’ k _|_ ^ d ’ k _ 


It is trivial to show that j3 d,k —» f3 D when d and k increase to infinity, then, our 
aim is to show that there exists a sequence kd —> 00 such that 


111 ' 


d,k d | 


0 . 


(4.4) 


To do that, fix k < d and consider the coefficients of 7 d,fc wit h res pect the basis 
{t/jf ,.., i.e. Sf = ( 7 d , i()f) for i = 1 ,.., k, where from Remark 


3.1 


5 d = (Var[(X,^ d )]) 1 • Cov [(X, *!>?), (X, (3 F )] . 
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Then, defining Xf := Var [(X, iff)] and applying Cauchy-Schwartz inequality, we 
obtain that, for any i = 1, k, 


(S^ 2 < (Xf)~ l Var[(X,fi F )] < (\\fi 


lF d \\2 


From (B.ll we have that Xf is increasing in d, so that Xf > Xf. Therefore, for any 
i = 1, k , we have that 


(sty < 


11-8 


II 2 


and hence 


111 ' 


:,k II2 


= E(E 2 < E 


A! 


11-3 


F d II2 


Moreover, from (B.l) we know that X\ < A* for any i < k, and calling C*, = 
fc(A m aa;/A^), we obtain 


IlY 


d,k 112 


: k 


X r 


S k k 


11/3 


Fd ||2 _ 


= c k ||/3 F i 2 , 


(4.5) 


for any fixed k > 1. Since ||/3 Fd || —>d 0 because fi F = 0, we can take a sequence 
fed —oo such that Ck d \\fi Fd \\ 2 —t 0, so that from (4.5) we get (4.4). As a conse¬ 


quence, the sequence jlmp,^^ fif’ kd ', d > l| converges when d —*• oo, which let us 
consider as a proper estimator of fi € I? for large d. 


Finally, we can write the consistency result for the estimator fit’ 1 *, by letting k 
and d depending on the sample size n: under assumptions (i) and (ii), there exists 
a sequence {d n ;n > 1} such that 



R d 


(4.6) 


where d = d n and k = kd n for any n > 1. Result (4.6) can be written as follows 

!I#P-/3|| \\fi s± \\, 


which implies that, when fi £ D, 

\\fi d n k -fi\\ 


o. 


Remark 4.1 Assumption (i) is essential to consider fif’ k as a proper estimator of 
fi. To see this, consider the following example, where (i) fails, i.e. fi F ^ 0, and 
there is no sequence {kd',d > 1} such that ||y d > fe d|| convergent. In particular, let 
fi = c • cf>, with (f> £ F and ||^>|| = 1. Then, take a process X defined as follows 


X 


OO 

E ZdVXdTd + 

d= 1 


\ 

E^Va7J </> 


where {<fid',d > 1} is an orthonormal basis of D and {Zk',k > 1} are i.i.d. r.v. 
with zero mean and unit variance. Then, define the sequence {Dd‘, d > 1} as Dd = 
span{(pi,.., ifd} ■ Hence, for any k = 1 ,..,d we have that "ff = 1, which implies 
|| 7 fcd,d || = \fkd —>d 00 for any divergent sequence {kd',d > 1}. 
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A Formal characterization of the sub-space E 


This section focuses on computing explicitly the following quantities introduced in 
the Section [3j 

(1) the orthonormal basis of E: {<p E ', k = 1, .., d}; 

(2) the multivariate projection matrix P : that transforms the basis 

coefficients of elements in D in the basis coefficients of elements in E. 

(3) the functional projection operator n : D —>■ E C S ot D on S 

Let us consider point (1). First, project the basis of D ({<p E ; k = 1,d}) on S, so 
obtaining a dirn(S') x d-matrix A, where [A\^ = (ipf,ip E ). Note that A may have 
infinite rows if dim(5) = oo. Then, the basis of D projected on S generates d linear 
independent functions given by A T ip s (t), that is a basis for E. It is easy to show 
that A T ip s (t) are linear independent since <p E , ip E are, and DnS 1 - = 0. To make 
A T ip s (t) be an orthonormal basis for E we do some calculations, obtaining: 

V> E (t) = V s D~ 1/2 V E A T ^ s (t) 1 (A.l) 

where Do and Vo represent the eigen-structure of A T A ( A T AVd = Vo Do) and Us 
is an arbitrary d x d-orthonormal matrix that allows the basis of E to be changed; 
without loss of generality, we can consider Vs = Id- Note that, except for Vs, the 
basis <p E (t) is independent of the choice of the basis ip D (t) and ip s (t). It is worth 
saying that the eigenvalues in Do are all strictly positive since A T A has full rank, 
since ipf, --,p E are linear independent. Moreover, the eigenvalues in Do are all less 
or equal to one since A is a projection operator. 


Now, consider point (2). From (A.l) the projection matrix P from D to E can 
be defined as 

P := {p> E {t), (<p D (t)) T ) = V s D~ 1/2 V£A T (cp s (t),(ip D (t)) T ), 
since (ip s (t), (y> r> (t)) T ) = A and VoA T A = DoVo , we obtain 


P = VgD^VS- 


(A.2) 


Note that, using (A.2) we can rewrite (A.l) as 

P 


E (t) = ( P~ 1 ) T A T ip s (t ). 


Then, from the vectorial estimate in E given by (3.2), we can obtain the vectorial 
estimate in D with /3 E = -P -1 (/ 3 E ), and finally compute the functional estimate 
P E (t) = ((3n) T ip D (t). This coincides with the solution of (2.6). 

Finally, consider point (3). Using the projection matrix P we can define the 
functional operator 7r as follows 

tt ( s ) = { p {9,V D {t))) T P E (t), 
for any g £ D. Then, using |M2l ) we can easily obtain 


«-) = ((;V D m Wit)- 


(A.3) 


Note that n is independent of any choice of basis of S, D and E. Using (A.3), 


once we get the vectorial estimate in E from (3.2), we can immediately compute the 
functional estimate f3 E (t) = (f3 E ) T tp E (t), and then obtain the functional estimate 
in D, i.e. 0% = (tt)" 1 ^). 
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B Increasing information property 

In this section, we discuss an interesting property concerning the behavior of the 
eigenvalues of the covariance matrix when its dimension increases. 

Let { M = [ m[^],n > 1} be a sequence of symmetric matrices such that, for 
each n > 1, M is a n x n matrix with for any 1. In 

other words, is obtained by M^ by deleting the lat row and column. The 

eigenvalues are real, and are ordered according to the following general result proved 
by Cauchy in [2J p. 187]. 


Theorem B.l ([6j, p. 125]) On the nested sequence (M^) n of matrices given 
above, denote with {A k ;k = l..,n} the sequences of the ordered eigenvalues of . 
Then, for any n> 1, 


, 2 > A 2 > A 3 


n+1 >•••> A” > A"|f 


A” +1 > A? > A" +1 
A direct consequence of the previous theorem is 

A, fe <Af, Xi<X k k , Vi<k<d. 


(B.l) 


This result is applied in Section 4.2 where M is represented the covariance matrix 


of the random vector ((X,tpi),..,{X,tp n )). In this context, a direct interpretation 
of (B.l) is that the variance of X projected into a subspace increases when further 
components are added. 


C Simulation settings 

The settings of the simulation study presented in Section [3] are the following. 

(1) Data Xi(t) and regression coefficient (3(t) belong to the Hilbert space L 2 (T) 
with T = [—1,1] closed interval. 

(2) The finite dimensional sub-spaces we consider are: 

E = SpanjT/v^, sjl>j8{3t 2 - 1), ^9/128(35^ - 30t 2 + 3)} 


and 

D g = Spanjcos (6>)l/v / 2+sin(0) v / 372i, ^/5/8(3t 2 -l), v / 9/128(35t 4 -30t 2 +3)}, 

with 9 £ [0, 2-k]. 

Observe that E = Dq. 

For each i = 1, ...,n where n is the sample size (in our examples n = 500), 

x i(t) = 'V/A' fO- 

jeJt 


where {9 k (t) } = {1/V2} [J{cos (irkt), k = 1,...}, aj are randomly sampled from a 
uniform distribution U ~ Unif[_ 10il o], Vi = 0.01 ,r)j = 1/j, forj > 1 and J* is a 
subset of size Z (with Z Poisson random variable Z ~ 'P(A)) of the integer from 1 
to 2 * Z. We set A = 10. 
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Chosen a function f3(t) £ L 2 (T ) the scalar responses yi,...,y n are generated as 
yi = f T f3(t)Xi(t)dt + Ci, where ej ~ A/"(0,1). We repeat the estimation procedure 
M = 100 times. 

In Figure 0 the true /3(f) is /3(f) = t 2 + 2t + 1/3, in Figure [ 2 J the true /3(f) is 
/3(f) = l[_o. 5 ,o. 5 ] W and hr Figure[3]the true /3(f) is /3(f) = t 4 . 
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